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Abstract — We propose a new technique to infer tlie structure 
and extract the tokens of data from the semi-structured web 
sources which are generated using a consistent template or layout 
with some implicit regularities. The attributes are extracted and 
labeled reversely from the region of interest of targeted contents. 
This is in contrast with the existing techniques which always 
generate the trees from the root. We argue and show that our 
technique is simpler, more accurate and effective especially to 
detect the changes of the templates of targeted web pages. 

Index Terms — data extraction; data mining; web-based infor- 
mation system 



I. Introduction 

During the last decade, most websites are providing the 
information generated from the structured data in an underly- 
ing database through certain predefined templates or layouts. 
Following the great number of web pages available on the net, 
these semi-structured web sources contain rich and unlimited 
valuable data for a variety of purposes. Extracting those 
data and then rebuilding them into a structured database are 
a challenge to realize an automatic data mining from web 
sources. 

Several methods for these purposes have been proposed pre- 
viously in the literature. Some of them can be classified as the 
so-called wrappers, for instance fT\, p), fi\, pl, fSl, and have 
been briefly surveyed in 1 10|. The wrapper technique allows an 
automatic data extraction through predefined wrapper created 
for each target data source. The wrappers then accepts a quesry 
against the data source and returns a set of structured results 
to the calling application. 

On the other hand, there are several automatic methods 
without a manual initial learning process. For example, some 
methods are generating a template automatically from first 
multiple pages before extracting the rest of data based on the 
template |j6l, Q, H). A more comprehensive method without 
requiring multiple pages has also been proposed using a page 
creation model which captures the main characteristics of 
semi-structured web pages to derive the set of properties |9 |. 
Though, the last method is more intended to extract the lists 
of data records from a single web page with sibling subtrees. 

Recently we have worked on extracting the semi-structured 
data from targeted web pages with specific topics, that is the 
so-called focused web-harvesting [11]. The method is suitable 
for in particular the 'indirect data integration' which is not 



tolerant to any error. The architecture is inspired and the com- 
bination of focused web-crawling and regular web-harvesting. 
The focused web-crawling does not indiscriminately crawl the 
web pages like general purpose search engines, but attempts to 
download pages that are similar to each other |12|. According 
to the data integration purposes and its requirement of high 
accuracy, the focused web-harvesting technique adopts human 
intervention in the initial setup by providing the targeted URL 
of list of data, for instance the publication list, and defining 
the template of the final page contains the relevant information 
by labeUng the attributes. In the case of detail information of 
publication page as shown in Fig. [T] the relevant information 
and labeling are ranging from title, authors, abstract to the 
fulltext if available. The method has been applied to the 
Indonesian Scientific Index (ISI) which integrates the scientific 
data from scientific and academic institutions across Indonesia 
through their websites ifTJj . The system is also open for public 
under GNU License at SourceForge.net (14]. Obviously, in 
spite of its high accuracy and unnecessary machine-learning 
like system, the method is suffered from tedious labor time at 
its initial setup to determine the tokens, to label the attributes, 
and is lacking of the ability to detect effectively later changes 
of the targeted page templates. Because the labels and tokens 
are represented as DOM trees which are sensitive to later 
changes of targeted web templates lITSl . llT6l . 

In this paper, in order to overcome the above-mentioned 
problems we present a new method to extract the tokens 
reversely from the region-of-interest (Rol) at the final web 
pages, and further label each attribute as normal procedure. 
This technique is in contrast with the existing mechanism 
which always starts from the root of web source and also the 
root of Rol. We argue that it is more efficient and accurate to 
extract the tokens, and on the other hand to detect the later 
template changes withour any ambiguities. We should also 
remark that the reverse method is applicable for any existing 
methods for data extraction, especially the ones which require 
initial setup by human intervention to define and label the 
attributes. This is actually similar to the previous method ifTTll , 
but instead of using the PAT tree [18] we use the DOM tree 
like mechanism [16]. Moreover we improve its accuracy by 
taking into account the lower part of tree and not only the 
trees from the root. 

The paper is organized as follows. First, after this brief 
introduction we describe our approach in Sec. [Ill In the 
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Fig. 1. The example of the final Rol page of a publication on the web and its Rol's HTML source. 



subsequent section, Sec.|IIIl we present the implementation and 
the web interface to define the initial setup for each targeted 
web. Finally we summarize the paper and provide some future 
issues and further development. 

II. The reverse mechanism 

No matter the method used to extract and to label the tokens 
from a web template or layout, correct initial setup is crucial 
for further data extraction. As mentioned before, this point 
plays an important role for indirect data integration which has 
no tolerance to any errors. This makes some methods based 
on the machine learning system are useless. 

From now, please note that we are not going to deal with 
the algorithms to mine the labeled data since the tasks after 
labeling can be done further using any existing methods, nor 
to find the relevant pages of data list which has been discussed 
in our previous work 1 1 1 1 and many previous works elsewhere. 
The reverse mechanism can be outlined recursively as follows 

1) Determine the URL of the final web page with desired 
information like Fig. [T] 

2) Provide the whole sentences of the Rol by copying and 
pasting the 'displayed desired text'. 

3) Provide the whole sentence(s) of each sub-RoI and 
assign the attributes for each of them. 

4) Crawl the source. 

5) Parse and clean the text-format HTML tags like <b>, 
<i> etc. 

6) Take the upper part of source from the top till the last 
one before the first sentence of Rol. Parse and clean all 



texts inside except the layout-format HTML tags, like 
<tr> etc. Do the same thing for the lower part that is 
from the end of last sentence of Rol till the bottom. 
7) Calculate the number of 'open-tag' (riot) and 'closed- 
tag' (net) from the deepest part in term of desired 
content, that is the nearest tags from the Rol. 
We should stress here that there is no need for the adminis- 
trators to provide the web page sources at all. Open-tag here 
means the tags which have no pair in upper or lower part, 
while the closed-tag denotes the pairing tags within upper or 
lower part. Of course, our interest is only in the open-tag which 
should describe the whole structure of web template. 

Following the above procedures, we can obtain a kind of 
DOM tree as shown in Fig. [21 We can further calculate the 
number of trees according to the number of open-tags. Please 
remind that the calculation is done horizontally, from left to 
right shown by the arrow in the figure. The number of trees 
in upper and lower parts are determined by, 

E EE Hot - "•ct • (1) 

Concerning all possibilities on the number of trees in upper 
and lower parts, therefore we can generally categorize the web 
structures through the discrepancies between both numbers as, 

^ = ^uppcr ^lowcr 

{— : fully symmetry 
< : lower asymmetry (2) 
> : upper asymmetry 

Fig. |2]provides an example of tree in the case of Fig. [T] which 
is accidently asymmetry. That means the number of trees in 
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Fig. 2. The tree for the example of Rol given in Fig. [T] The dashed box denotes the RoL 



the upper and lower parts are not the same, Supper 7^ Slower- 
Again, we can use one of the existing methods to calculate the 
number of trees like the PAT tree algorithm [IS] and so forth. 

Through the discussion above, it is clear that the present 
method has several advantages : 

• We can separate independently the structure and the rules 
to obtain the Rol and the structure inside. 

• We can find out the template changes and its relevance 
with the desired Rol, since we can compare and see the 
pairing tokens between the upper and lower parts. 

We discuss these points in more detail through the real 
implementation at ISI in the subsequent section. 

III. The IMPLEMENTATION 

Our approach to web page information extraction has been 
experimentally implemented into the system of ISI. Following 
the wrapper induction programs which are usually supple- 
mented by a user friendly GUI, we have also developed a 
web based interface to perform the initial setup. The example 
used at ISI is given in Fig. [3] which shows the web interface 
to define the content of Rol from a choosen final page and the 
labels for each attribute. 

ISI can now efficiently detect the changes of targeted web 
templates by comparing the old and new numbers of A. It is 
done by executing the check procedure each time prior to the 



new crawling works into the same targeted web pages. Since 
the parameter as in Fig. [3] has been stored in the system, it can 
be used not only for the initial setup but also for rechecking 
the templates in a regular basis. 

The discrepancies between the old and new numbers of A 
is usefull to detect easily the template changes time by time, 
and at the same time determine where the changes happened. 
It can be summarized as below, 

1) No change at all : 

A new Aold. y>new y>old . ynew y^old 

' upper upper' lower lower 

2) Simultaneous changes with same size in both upper and 
lower trees : 

Anew Aold. ynew / y^old . y^ncw / y^old 

^ ^ ' ^upper / ^upper' ^lower / lower 

3) Only one tree has changed : 

Anew / Aold. y^new / y^old . y^ncw y^old 

^ r '-^ ' ^upper r ^upper' ^lower lower 

or : 

A new / Aold. y^new y^old . y^ncw / y^old 

/ ^ ^upper upper' lower / lower 



4) Both trees have changed differently : 

Anew / Aold. y^new / y^old . y^ncw / y^old 
^ 7^ ^ ' ^upper T ^upper' '^lower r '^lower 



Apparently, in the case 1 no need to alter the saved intial 
parameter. In contrast, from the case 2, 3 and 4 we can deduce 
that the templates have been changed, either in the upper, lower 
or both trees. The important point is, once the template changes 
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Fig. 3. The web interface to define the Rol and to label the attributes. 



have been detected, the system automatically replace the old 
version of template with the new one. 

Furthermore, the reverse method can be used to rebuild the 
content of Rol in terms of labels which enable us to restructure 
the database for further purposes. The procedure is completely 
the same as extracting the Rol. Tab. |T] shows the delimiters 
extracted from the Rol in Fig. [T] using the same interface as 
Fig.E] 

IV. Summary 

We have discussed a simple method based on the reverse 
algorithm and DOM tree to extract the Rol and label the 
relevant attributes in the initial setup. The resulted patterns can 
be used further to automatically extract the data from crawled 
targeted web pages. We argue that the method and its web 
interface reduce the administrator works significantly, while 
on the other hand improve the accuracy and speed of finding 
the tokens and labeling the attributes. We have found that this 
method is very effective to detect the template changes, for 
instance newly inserted advetorial in the middle of upper or 
lower tree which often occurs in any websites and leads to 
difficulties in existing methods. 

The experimental works on applying the method to the 
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available huge number of data stored at ISI is still under 
progress. The expected results and its effectiveness to detect 
the altered web pages will be analysed and published in a 
more complete and detail paper elsewhere. However, according 
to our trial experiments using 10000 data from several web 
sources, the method performs very well. It succeeded in 
detecting any template changes and improved the speed of 
whole processes up to 20%. 

Finally, we should remark that in principle the method is 
applicable for the web sources in a form of list of data. The 
work on this matter is also in progress. 
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